17 research outputs found

    Economy of Effort or Maximum Rate of Information? Exploring Basic Principles of Articulatory Dynamics

    Get PDF
    Economy of effort, a popular notion in contemporary speech research, predicts that dynamic extremes such as the maximum speed of articulatory movement are avoided as much as possible and that approaching the dynamic extremes is necessary only when there is a need to enhance linguistic contrast, as in the case of stress or clear speech. Empirical data, however, do not always support these predictions. In the present study, we considered an alternative principle: maximum rate of information, which assumes that speech dynamics are ultimately driven by the pressure to transmit information as quickly and accurately as possible. For empirical data, we asked speakers of American English to produce repetitive syllable sequences such as wawawawawa as fast as possible by imitating recordings of the same sequences that had been artificially accelerated and to produce meaningful sentences containing the same syllables at normal and fast speaking rates. Analysis of formant trajectories shows that dynamic extremes in meaningful speech sometimes even exceeded those in the nonsense syllable sequences but that this happened more often in unstressed syllables than in stressed syllables. We then used a target approximation model based on a mass-spring system of varying orders to simulate the formant kinematics. The results show that the kind of formant kinematics found in the present study and in previous studies can only be generated by a dynamical system operating with maximal muscular force under strong time pressure and that the dynamics of this operation may hold the solution to the long-standing enigma of greater stiffness in unstressed than in stressed syllables. We conclude, therefore, that maximum rate of information can coherently explain both current and previous empirical data and could therefore be a fundamental principle of motor control in speech production

    Pre-low raising in Cantonese and Thai: Effects of speech rate and vowel quantity

    Get PDF
    Although pre-low raising (PLR) has been extensively studied as a type of contextual tonal variation, its underlying mechanism is barely understood. This paper explored the effects of phonetic vs phonological duration on PLR in Cantonese and Thai and examined how speech rate and vowel quantity interact with its realization in these languages, respectively. The results for Cantonese revealed that PLR always occurred before a large falling excursion (i.e., high-low); in other tonal contexts, it was observed more often in faster speech. In the Thai corpus, PLR also occurred before large falling excursions, and there was more PLR in short vowels. These results are discussed in terms of possible accounts of the underlying mechanism of PLR

    Estimating underlying articulatory targets of Thai vowels by using deep learning based on generating synthetic samples from a 3D vocal tract model and data augmentation

    Get PDF
    Representation learning is one of the fundamental issues in modeling articulatory-based speech synthesis using target-driven models. This paper proposes a computational strategy for learning underlying articulatory targets from a 3D articulatory speech synthesis model using a bi-directional long short-term memory recurrent neural network based on a small set of representative seed samples. From a seeding set, a larger training set was generated that provided richer contextual variations for the model to learn. The deep learning model for acoustic-to-target mapping was then trained to model the inverse relation of the articulation process. This method allows the trained model to map the given acoustic data onto the articulatory target parameters which can then be used to identify the distribution based on linguistic contexts. The model was evaluated based on its effectiveness in mapping acoustics to articulation, and the perceptual accuracy of speech reproduced from the estimated articulation. The results indicate that the model can accurately imitate speech with a high degree of phonemic precision

    Computational modelling of double focus in American English

    Get PDF
    This study investigated how double focus in English statements and questions can be computationally modelled. PENTAtrainer2 was used to learn syllable-sized multi-functional targets from a corpus of 1960 English utterances, with controlled variations in lexical stress, focus, modality, and sentence length. The results showed that the learned targets could generate F0 contours close to the original. In particular, the asymmetry in the interaction between focus and modality was effectively simulated

    Toward invariant functional representations of variable surface fundamental frequency contours: Synthesizing speech melody via model-based stochastic learning

    Get PDF
    Variability has been one of the major challenges for both theoretical understanding and computer synthesis of speech prosody. In this paper we show that economical representation of variability is the key to effective modeling of prosody. Specifically, we report the development of PENTAtrainer—A trainable yet deterministic prosody synthesizer based on an articulatory–functional view of speech. We show with testing results on Thai, Mandarin and English that it is possible to achieve high-accuracy predictive synthesis of fundamental frequency contours with very small sets of parameters obtained through stochastic learning from real speech data. The first key component of this system is syllable-synchronized sequential target approximation—implemented as the qTA model, which is designed to simulate, for each tonal unit, a wide range of contextual variability with a single invariant target. The second key component is the automatic learning of function-specific targets through stochastic global optimization, guided by a layered pseudo-hierarchical functional annotation scheme, which requires the manual labeling of only the temporal domains of the functional units. The results in terms of synthesis accuracy demonstrate that effective modeling of the contextual variability is the key also to effective modeling of function-related variability. Additionally, we show that, being both theory-based and trainable (hence data-driven), computational systems like PENTAtrainer can serve as an effective modeling tool in basic research, with which the level of falsifiability in theory testing can be raised, and also a closer link between basic and applied research in speech science can be developed

    Explaining the PENTA mode: A reply to Arvaniti and Ladd (2009)

    Get PDF
    his paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech. Target Approximation simulates the articulatory realisation of underlying pitch targets – the prosodic primitives in the framework. Parallel Encoding provides an operational scheme that enables simultaneous encoding of multiple communicative functions. We also outline how PENTA can be computationally tested with a set of software tools. With the help of one of the tools, we offer a PENTA-based hypothetical account of the Greek intonational patterns reported by Arvaniti & Ladd, showing how it is possible to predict the prosodic shapes of an utterance based on the lexical and postlexical meanings it conveys

    Model-based exploration of linking between vowel articulatory space and acoustic space

    Get PDF
    While the acoustic vowel space has been extensively studied in previous research, little is known about the high-dimensional articulatory space of vowels. The articulatory imaging techniques are limited to tracking only a few key articulators, leaving the rest of the articulators unmonitored. In the present study, we attempted to develop a detailed articulatory space obtained by training a 3D articulatory synthesizer to learn eleven British English vowels. An analysis-by-synthesis strategy was used to acoustically optimize vocal tract parameters that represent twenty articulatory dimensions. The results show that tongue height and retraction, larynx location and lip roundness are the most perceptually distinctive articulatory dimensions. Yet, even for these dimensions, there is a fair amount of articulatory overlap between vowels, unlike the fine-grained acoustic space. This method opens up the possibility of using modelling to investigate the link between speech production and perception

    Evoc-Learn - High quality simulation of early vocal learning

    Get PDF
    Evoc-Learn is a system for simulating early vocal learning of spoken language in ways that can overcome some of the major bottlenecks in vocal learning. The system consists of VocalTractLab, a geometrical three-dimensional vocal tract model for simulating aeroacoustics and articulatory dynamics, a coarticulation model for controlling the temporal dynamics of articulation, and a sensory feedback system for guiding the learning process. We will demonstrate each component of Evoc-Learn and show how they work together to simulate the learning of highly intelligible speech

    An agonist-antagonist pitch production model

    Get PDF
    Prosody is a phenomenon that is crucial for numerous fields of speech research, accenting the importance of having a robust prosody model. A class of intonation models based on the physiology of pitch pro- duction are especially attractive for their inherent multilingual support. These models rely on an accurate model of muscle activation. Tradi- tionally they have used the 2nd order spring-damper-mass (SDM) mus- cle model. However, recent research has shown that the SDM model is not sufficient for adequate modelling of the muscle dynamics. The 3rd order Hill type model offers a more accurate representation of mus- cle dynamics, but it has been shown to be underdamped when using physiologically plausible muscle parameters. In this paper we propose an agonist-antagonist pitch production (A2P2) model that both validates and gives insight behind the improved results of using higher-order crit- ically damped system models in intonation modelling
    corecore